Colombia COVID-19 - Central region
library(MASS)
library(readr)
library(dplyr)
library(ggplot2)
library(bayesplot)
library(RColorBrewer)
library(leaflet)
library(geojsonio)
library(htmltools)
library(htmlwidgets)
library(rstan)Colombia COVID-19
LINK: https://www.kaggle.com/camesruiz/colombia-covid19-complete-dataset
DESCRIPTION: Coronavirus (COVID-19) made its outbreak in Colombia with the first confirmed in the country on march 6th, since then, number of confirmed cases has been increasing and deaths related to the virus are starting to have the first confirmed cases. This data set contains complete information about confirmed cases, deaths and number of recovered patients according to the daily reports by the colombian health department (Ministerio de Salud)
GOAL: Build a model for the number of confirmed cases in the different Colombia regions. You have the access to some covariates, such as: Edad (age), Sexo (Sex), Pais de procedencia (origin country) of the individual cases. Try to test the predictive accuracy of your selected model.
ATTENTION: Three countries are here considered: Colombia, Mexico and India. Each different group of students should focus on a geographical sub-area of one of the three countries, say the northern, the central or the southern part of the countries, by pooling all the regions/states/departments belonging to the considered area. Say: group A focuses on Northern Mexico, group B on Central Mexico, and so on. The distinction in northern, central and southern is not strict, you have some flexibility.
Our Project
We decided to do central Colombia because it is where the capital is.
The largest cities in the country are Bogotá (in the center), Medellín (in the north, close to central), Cali (in the center) and Barranquilla (extreme north).
Dataset - First Exploration
colombia_covid <- read_csv("data/colombia_covid.csv")
unique(colombia_covid$`Departamento o Distrito`)## [1] "Bogotá D.C." "Valle del Cauca"
## [3] "Antioquia" "Cartagena D.T. y C"
## [5] "Huila" "Meta"
## [7] "Risaralda" "Norte de Santander"
## [9] "Caldas" "Cundinamarca"
## [11] "Barranquilla D.E." "Santander"
## [13] "Quindío" "Tolima"
## [15] "Cauca" "Santa Marta D.T. y C."
## [17] "Cesar" "San Andrés"
## [19] "Casanare" "Nariño"
## [21] "Atlántico" "Boyacá"
## [23] "Córdoba" "Bolívar"
## [25] "Sucre" "La Guajira"
# options(tibble.print_max = Inf) # to show all the elements of the list,
# but set it also for other chunks options(tibble.width = Inf)
colombia_covid %>% group_by(`Departamento o Distrito`) %>% count()## # A tibble: 26 x 2
## # Groups: Departamento o Distrito [26]
## `Departamento o Distrito` n
## <chr> <int>
## 1 Antioquia 127
## 2 Atlántico 4
## 3 Barranquilla D.E. 31
## 4 Bogotá D.C. 542
## 5 Bolívar 3
## 6 Boyacá 6
## 7 Caldas 16
## 8 Cartagena D.T. y C 39
## 9 Casanare 2
## 10 Cauca 12
## # ... with 16 more rows
Colombia is divided into 32 departments. According to Wikipedia we miss the Departments of Amazonas, Arauca, Caquetá, Chocó, Guainía, Guaviare, Magdalena, Putumayo, Vaupés, Vichada.
Bogotá, Distrito Capital in in the Cundinamarca Department. Barranquilla D.E. is a “Distrito Especial” but should be in the Atlántico Department.
The Districts (Spanish: Distrito) in Colombia are cities that have a feature that highlights them, such as its location and trade, history or tourism. Arguably, the districts are special municipalities. The districts are Bogotá, Barranquilla, Cartagena, Santa Marta, Cúcuta, Popayán, Tunja, Buenaventura, Turbo and Tumaco.
We miss Cúcuta, Popayán, Tunjaa, Buenaventura, Turbo and Tumaco.
# lat-long bogota<c(4.592164298, -74.072166378, 542)
valle_de_cauca <- c("Valle del Cauca", 3.43722, -76.5225, 150)
cauca <- c("Cauca", 2.43823, -76.6131592, 12)
antioquia <- c("Antioquia", 6.25184, -75.56359, 127)
cartagena <- c("Cartagena D.T. y C", 10.39972, -75.51444, 39)
huila <- c("Huila", 2.9273, -75.2818909, 30)
meta <- c("Meta", 4.142, -73.62664, 12)
risaralda <- c("Risaralda", 4.8133302, -75.6961136, 35)
norte_santander <- c("Norte de Santander", 7.89391, -72.50782, 21)
caldas <- c("Caldas", 5.06889, -75.51738, 16)
cudinamarca <- c("Cundinamarca", 4.862437, -74.058655, 42)
barraquilla <- c("Barranquilla D.E.", 10.96854, -74.78132, 35) #atlantico
santader <- c("Santander", 7.12539, -73.1198, 12)
quindio <- c("Quindío", 4.535, -75.67569, 23)
tolima <- c("Tolima", 4.43889, -75.2322235, 14)
santa_marta <- c("Santa Marta D.T. y C.", 11.24079, -74.19904, 12)
cesar <- c("Cesar", 10.46314, -73.25322, 16)
san_andres <- c("San Andrés", 12.5847197, -81.7005615, 2)
casanare <- c("Casanare", 5.33775, -72.39586, 2)
narino <- c("Nariño", 1.21361, -77.28111, 6)
boyaca <- c("Boyacá", 5.767222, -72.940651, 6)
cordoba <- c("Córdoba", 8.74798, -75.88143, 2)
bolivar <- c("Bolívar", 10.39972, -75.51444, 3)
sucre <- c("Sucre", 9.3047199, -75.3977814, 1)
guajira <- c("La Guajira", 11.54444, -72.90722, 1)
gis_data <- data.frame(name = "Bogotá D.C.", latitude = 4.624335, longitude = -74.063644,
cases = 542) #bogotà
gis_data$name <- as.character(gis_data$name)
gis_data <- rbind(gis_data, cauca)
gis_data <- rbind(gis_data, valle_de_cauca)
gis_data <- rbind(gis_data, antioquia)
gis_data <- rbind(gis_data, cartagena)
gis_data <- rbind(gis_data, huila)
gis_data <- rbind(gis_data, meta)
gis_data <- rbind(gis_data, risaralda)
gis_data <- rbind(gis_data, norte_santander)
gis_data <- rbind(gis_data, caldas)
gis_data <- rbind(gis_data, cudinamarca)
gis_data <- rbind(gis_data, barraquilla)
gis_data <- rbind(gis_data, santader)
gis_data <- rbind(gis_data, quindio)
gis_data <- rbind(gis_data, tolima)
gis_data <- rbind(gis_data, santa_marta)
gis_data <- rbind(gis_data, cesar)
gis_data <- rbind(gis_data, san_andres)
gis_data <- rbind(gis_data, casanare)
gis_data <- rbind(gis_data, narino)
gis_data <- rbind(gis_data, boyaca)
gis_data <- rbind(gis_data, cordoba)
gis_data <- rbind(gis_data, bolivar)
gis_data <- rbind(gis_data, sucre)
gis_data <- rbind(gis_data, guajira)
gis_data$latitude <- as.numeric(gis_data$latitude)
gis_data$longitude <- as.numeric(gis_data$longitude)
gis_data$cases <- as.numeric(gis_data$cases)getColor <- function(gis_data) {
sapply(gis_data$cases, function(cases) {
if (cases <= 10) {
"green"
} else if (cases <= 100) {
"orange"
} else {
"red"
}
})
}
icons <- awesomeIcons(icon = "ios-close", iconColor = "black", library = "ion",
markerColor = getColor(gis_data))
dept <- geojsonio::geojson_read("data/Colombia.geo.json", what = "sp")
labels <- sprintf("<strong>%s</strong><br/>", dept$NOMBRE_DPT) %>% lapply(htmltools::HTML)
m <- leaflet(data = gis_data) %>% addTiles() %>% addAwesomeMarkers(~longitude,
~latitude, icon = icons, label = ~as.character(cases)) %>% addProviderTiles(providers$CartoDB.Positron) %>%
addPolygons(data = dept, opacity = 0.2, color = "yellow", dashArray = "3",
fillOpacity = 0.1, highlight = highlightOptions(weight = 5, color = "#666",
dashArray = "", fillOpacity = 0.7, bringToFront = TRUE), label = labels)
mThe color of the pins is related with the number of cases: if they are less than 10 the color is “green”, if they are less than 100 the color is “orange”, otherwise it is “red”.
On the map there are all the cities/departments for which we have data. We can notice that we don’t have any data in the south of the country.
Reading here and there I found that Colombia in divided in 5 regions, the central one comprises: Boyacá, Tolima, Cundinamarca, Meta, Bogotà DC.
ANGELA: Seeing Wikipedia I think that the Orinoquía Region (Meta, Arauca, Casanare and Vichada Departments) is in the center, so I would add also Arauca, Casanare and Vichada. I noticed that we only have Casanare, the other two doesn’t have data.
GAIA: added Casanare, for the others we have no data!
However, since in our assignment Colombia is divided in 3 parts, I think that we should add some more regions (e.g. Quindío, Valle del Cauca, Risaralda, Celdas, Boyacá and possibly Antioquia and Santander)
# slice the main dataset
central.colombia.dep <- c("Bogotá D.C.", "Tolima", "Cundinamarca", "Meta", "Boyacá",
"Quindío", "Cauca", "Valle del Cauca", "Risaralda", "Caldas", "Boyacá",
"Antioquia", "Santander", "Casanare")
central.colombia.rows <- which(colombia_covid$`Departamento o Distrito` %in%
central.colombia.dep)
colombia_covid <- colombia_covid[central.colombia.rows, ]
# slice the gis_data dataset
central_gis_data <- gis_data[which(gis_data$name %in% central.colombia.dep),
]
# slice the geojson dataset
central_dept <- c("SANTAFE DE BOGOTA D.C", "TOLIMA", "CUNDINAMARCA", "META",
"BOYACA", "QUINDIO", "CAUCA", "VALLE DEL CAUCA", "RISARALDA", "CALDAS",
"BOYACA", "ANTIOQUIA", "SANTANDER", "CASANARE")
central.dept <- dept[which(dept@data$NOMBRE_DPT %in% central_dept), ]getColor <- function(central_gis_data) {
sapply(central_gis_data$cases, function(cases) {
if (cases <= 10) {
"green"
} else if (cases <= 100) {
"orange"
} else {
"red"
}
})
}
icons <- awesomeIcons(icon = "ios-close", iconColor = "black", library = "ion",
markerColor = getColor(central_gis_data))
# now colombia is yellow and our departments are red
central_map <- leaflet(data = central_gis_data) %>% addTiles() %>% addAwesomeMarkers(~longitude,
~latitude, icon = icons, label = ~htmlEscape(paste(name, cases, sep = " : "))) %>%
addProviderTiles(providers$CartoDB.Positron) %>% addPolygons(data = dept,
opacity = 0.1, color = "yellow", fillOpacity = 0.1) %>% addPolygons(data = central.dept,
opacity = 0.2, color = "red", dashArray = "3", fillOpacity = 0.1)
central_mapSome very basics plots
Let’s check the situation (and also the power of ggplot)!
Scattered infos about pandemic in Colombia (https://en.wikipedia.org/wiki/COVID-19_pandemic_in_Colombia):
the quarantine started on the 20th of March, since our data are from 6th of March to 2nd of April, it is very likeliy that quarantine effects are not witnessed in our data.
on March the 26th there was a damage in the machine that prepared the samples for processing and subsequent diagnosis of COVID-19, which affected the speed at which results were being produced. This could explain the very low number of confirmed cases.
theme_set(theme_classic())
region<-colombia_covid$`Departamento o Distrito`
nrows<-10
df <- expand.grid(y = 1:nrows, x = 1:nrows)
categ_table <- round(table(region) * ((nrows*nrows+1)/(length(region))))
df$category<-factor(rep(names(categ_table), categ_table))
waffle_chart <- ggplot(df, aes(x = x, y = y, fill = df$category)) +
geom_tile(color = "black", size = 0.5) +
scale_x_continuous(expand = c(0, 0)) +
scale_y_continuous(expand = c(0, 0), trans = 'reverse') +
scale_fill_brewer(palette = "Set3") +
labs(title="Frequency of cases across Departments", subtitle="Waffle Chart") +
theme(#panel.border = element_rect(size = 2),
plot.title = element_text(size = rel(1.2)),
axis.text = element_blank(),
axis.title = element_blank(),
axis.ticks = element_blank(),
legend.title = element_blank(),
legend.position = "right")
waffle_chartThe major number of cases are in the capital Bogotà.
theme_set(theme_classic())
# number of cases confirmed day by day
# fix day column in 'international' format so that R can fix the sorting
# properly
colombia_covid$`Fecha de diagnóstico` <- as.Date(colombia_covid$`Fecha de diagnóstico`,
format = "%d/%m/%Y")
colorCount <- length(unique(colombia_covid$`Departamento o Distrito`))
getPalette <- colorRampPalette(brewer.pal(9, "Spectral"))(colorCount)
daily_confirmed <- ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) +
scale_fill_manual(values = getPalette)
daily_confirmed + geom_histogram(aes(fill = `Departamento o Distrito`), width = 0.8,
stat = "count") + theme(axis.text.x = element_text(angle = 65, vjust = 0.6),
legend.position = "right") + labs(title = "Daily number of confirmed cases",
subtitle = "subdivided across departments", x = "Date of confirmation",
fill = "Department")The previous plot represents the daily incidence of the desease across all the departments we are taking into account.
Let’s check the general trend by looking at the cumulative number of confirmed cases (again, all “our” departments are taken into account):
colombia_covid$`ID de caso` <- 1:dim(colombia_covid)[1]
# the max of the ID de caso for each date is the cumulative number of cases
# confirmed up to that date
cumulative <- as.data.frame(colombia_covid %>% group_by(`Fecha de diagnóstico`) %>%
summarise(max(`ID de caso`)))
names(cumulative) <- c("Fecha de diagnóstico", "Cumulative confirmed")
cumulative## Fecha de diagnóstico Cumulative confirmed
## 1 2020-03-06 1
## 2 2020-03-09 3
## 3 2020-03-11 8
## 4 2020-03-12 10
## 5 2020-03-13 13
## 6 2020-03-14 21
## 7 2020-03-15 34
## 8 2020-03-16 46
## 9 2020-03-17 58
## 10 2020-03-18 82
## 11 2020-03-19 99
## 12 2020-03-20 144
## 13 2020-03-21 169
## 14 2020-03-22 197
## 15 2020-03-23 259
## 16 2020-03-24 356
## 17 2020-03-25 404
## 18 2020-03-26 414
## 19 2020-03-27 457
## 20 2020-03-28 517
## 21 2020-03-29 593
## 22 2020-03-30 678
## 23 2020-03-31 758
## 24 2020-04-01 899
## 25 2020-04-02 993
cumulative_confiremd <- ggplot(cumulative, aes(x = `Fecha de diagnóstico`, y = `Cumulative confirmed`))
cumulative_confiremd + geom_point(size = 3) + geom_segment(aes(x = `Fecha de diagnóstico`,
xend = `Fecha de diagnóstico`, y = 0, yend = `Cumulative confirmed`)) +
labs(title = "Cumulative number of confirmed cases", x = "Date of confirmation") +
theme(axis.text.x = element_text(angle = 65, vjust = 0.6))Here the growth seems exponential (and this is consistent with the fact that we are studying the early stages of the outbreak).
In order to confirm it we should fit a log-linear model, and check that it produces a constant growth rate (straight line).
Now let’s explore the distribution of cases across genders and age:
library(ggthemes)
brks <- seq(-250, 250, 50)
lbls <- as.character(c(seq(-250, 0, 50), seq(50, 250, 50)))
ggplot(data = colombia_covid, aes(x = `Departamento o Distrito`, fill = Sexo)) +
geom_bar(data = subset(colombia_covid, Sexo == "F")) + geom_bar(data = subset(colombia_covid,
Sexo == "M"), aes(y = ..count.. * (-1))) + scale_y_continuous(breaks = brks,
labels = lbls) + coord_flip() + labs(title = "Spread of the desease across genders",
y = "Number of cases", x = "Department", fill = "Gender") + theme_tufte() +
theme(plot.title = element_text(hjust = 0.5), axis.ticks = element_blank()) +
scale_fill_brewer(palette = "Dark3")Maybe in order to study the distribution of ages we should divide the ages in groups, for example 0-18, 18-30, 30-45, 45-60, 60-75, 75+.
# create column age_group (didn't modify the original dataset though)
age_groups <- colombia_covid %>% mutate(age_group = case_when(Edad <= 18 ~ "0-18",
Edad >= 19 & Edad <= 30 ~ "19-30", Edad >= 31 & Edad <= 45 ~ "31-45", Edad >=
46 & Edad <= 60 ~ "46-60", Edad >= 61 & Edad <= 75 ~ "60-75", Edad >=
76 ~ "76+"))
# compute percentage so that we can label more precisely the pie chart
age_groups_pie <- age_groups %>% group_by(age_group) %>% count() %>% ungroup() %>%
mutate(per = n/sum(n)) %>% arrange(desc(age_group))
age_groups_pie$label <- scales::percent(age_groups_pie$per)
library(ggrepel)
age_pie <- ggplot(age_groups_pie, aes(x = "", y = per, fill = factor(age_group))) +
geom_bar(stat = "identity", width = 1) + theme(axis.line = element_blank(),
plot.title = element_text(hjust = 0.5)) + labs(fill = "Age groups", x = NULL,
y = NULL, title = "Distribution of the desease across ages") + coord_polar(theta = "y") +
# geom_text(aes(x=1, y = cumsum(per) - per/2, label=label))
geom_label_repel(aes(x = 1, y = cumsum(per) - per/2, label = label), size = 3,
show.legend = F, nudge_x = 0) + guides(fill = guide_legend(title = "Group"))
age_pieThis is quite surprising.. I expected elder people to be more affected by Covid-19!
The overall life expectancy in Colombia at birth is 74.8 years (71.2 years for males and 78.4 years for females). Wikipedia
Instead, the median age of the population in 2015 was 29.5 years (30.4 in 2018, 31.3 in 2020), so it is a Country full of young people! link or link or link
Now we can analyze jointly the distribution of age and sex (sex distribution across group of age):
# library(ggpol)
keep <- c("Sexo", "age_group")
age_groups <- age_groups[names(age_groups) %in% keep]
age_groups_count <- aggregate(cbind(pop = Sexo) ~ age_group + Sexo, data = age_groups,
FUN = function(x) {
NROW(x)
})
age_groups_count$count <- ifelse(age_groups_count$Sexo == "F", age_groups_count$pop *
-1, age_groups_count$pop)
age_groups_count <- as.data.frame(age_groups_count[names(age_groups_count) !=
"pop"])
ggplot(age_groups_count, aes(x = age_group, y = count, fill = Sexo)) + geom_col() +
# facet_share(~Sexo, dir = 'h') +
coord_flip(clip = "off") + theme_minimal() + labs(title = "Distribution of sex by age",
y = "", x = "Age group")There isn’t much difference between the sexes among the different group of ages, I have the impression that the covariates present in the dataset won’t help us!! :(
We are now left to explore the Tipo variable:
theme_set(theme_classic())
# renamed the attribute since I had sime problem due to the presence of *
colnames(colombia_covid)[8] <- "tipo"
type <- ggplot(colombia_covid, aes(x = `Fecha de diagnóstico`)) + scale_fill_brewer(palette = "Set3")
type + geom_histogram(aes(fill = tipo), width = 0.8, stat = "count") + theme(axis.text.x = element_text(angle = 65,
vjust = 0.6)) + labs(title = "Daily number of confirmed cases", subtitle = "subdivided across type",
x = "Date of confirmation", fill = "Type")I think that en estudio means that it is not clear while the case is imported or not, however it seems like there are more imported cases, we can count them:
type_pie <- colombia_covid %>% group_by(tipo) %>% count() %>% ungroup() %>%
mutate(per = n/sum(n)) %>% arrange(desc(tipo))
type_pie$label <- scales::percent(type_pie$per)
type_pie <- type_pie[names(type_pie) != "per"]
colnames(type_pie) <- c("tipo", "total_number", "percentage")
type_pie## # A tibble: 3 x 3
## tipo total_number percentage
## <chr> <int> <chr>
## 1 Relacionado 291 29.3%
## 2 Importado 467 47.0%
## 3 En estudio 235 23.7%
Almost half of the total confirmed cases in our region of interest are imported, and a significant percentage is anknown wheter it is imported or not. Again this is in some sense interesting, but I don’t see clearly why this should be helpful in our model!
imported <- subset(colombia_covid, tipo == "Importado", select = c("País de procedencia"))
imported %>% group_by(`País de procedencia`) %>% count()## # A tibble: 55 x 2
## # Groups: País de procedencia [55]
## `País de procedencia` n
## <chr> <int>
## 1 0 1
## 2 Alemania 4
## 3 Alemania - Estambul 1
## 4 Arabia 1
## 5 Argentina 2
## 6 Aruba 2
## 7 Bélgica 1
## 8 Brasil 10
## 9 Canadá 1
## 10 Chile 2
## # ... with 45 more rows
here data are a bit dirty, however I don’t know if the effort of cleansing them will worth the result.. it depends wheter we decide to use this info in our analysis.
Analyze the epidemic curve separately for each department, considering only those that have more than 30 cases:
covid_dp<-colombia_covid %>%
group_by(`Fecha de diagnóstico`,`Departamento o Distrito`) %>%
count()
names(covid_dp)<-c("date", "dep", "n")
#departments with more than 30 cases
relevant<-c("Bogotá D.C.", "Valle del Cauca", "Antioquia", "Cundinamarca", "Risaralda")
covid_relevant_dp<-subset(covid_dp, dep %in% relevant)
covid_relevant_dp## # A tibble: 83 x 3
## date dep n
## <date> <chr> <int>
## 1 2020-03-06 Bogotá D.C. 1
## 2 2020-03-09 Antioquia 1
## 3 2020-03-09 Valle del Cauca 1
## 4 2020-03-11 Antioquia 3
## 5 2020-03-11 Bogotá D.C. 2
## 6 2020-03-12 Bogotá D.C. 2
## 7 2020-03-13 Bogotá D.C. 1
## 8 2020-03-13 Valle del Cauca 1
## 9 2020-03-14 Antioquia 3
## 10 2020-03-14 Bogotá D.C. 5
## # ... with 73 more rows
ggplot(covid_relevant_dp, aes(x=date, y=n, fill=dep)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_grid(dep~., scales="free_y") + #every level has a different count axis
theme_dark() +
theme(strip.text = element_text(face="bold", size=6))+
labs(title = "Epidemic curve for different departments",
subtitle = "considering only dept. with more than 30 cases",
x = "Date of confirmation",
y = "count",
fill = "Department")Analyze the curve of cumulative confirmed cases on those “relevant” department:
ggplot(covid_relevant_dp, aes(x=date, y=cumulative_dep, fill=dep)) +
geom_bar(stat="identity", show.legend = FALSE) +
facet_grid(dep~., scales="free_y") + #every level has a different count axis
theme_gray() +
theme(strip.text = element_text(face="bold", size=6))+
labs(title = "Cumulative number of cases for different departments",
subtitle = "considering only dept. with more than 30 cases",
x = "Date of confirmation",
y = "count",
fill = "Department")We can see that (except for the Bogotà department) we have a lot of “missing” columns, these are the days in which no data was recorded for the corresponding department, the cumulative number of cases is the same as the previous day reported in the dataset. Maybe there is a way to fix this!
covid_dp <- as.data.frame(covid_dp)
covid_dp$date <- as.Date(covid_dp$date, "%Y-%m-%d")
covid_dp <- covid_dp %>% mutate(BETWEEN0 = as.numeric(difftime(date, lag(date,
1), units = "days")), BETWEEN = ifelse(is.na(BETWEEN0), 0, BETWEEN0), elapsed_time = cumsum(as.numeric(BETWEEN))) %>%
dplyr::select(-c(BETWEEN0, BETWEEN))
write_csv(covid_dp, "data/central_colombia.csv")
covid_relevant_dp <- as.data.frame(covid_relevant_dp)
covid_dp$date <- as.Date(covid_dp$date, "%Y-%m-%d")
covid_relevant_dp <- covid_relevant_dp %>% mutate(BETWEEN0 = as.numeric(difftime(date,
lag(date, 1), units = "days")), BETWEEN = ifelse(is.na(BETWEEN0), 0, BETWEEN0),
elapsed_time = cumsum(as.numeric(BETWEEN))) %>% dplyr::select(-c(BETWEEN0,
BETWEEN))
covid_relevant_dp## date dep n cumulative_dep elapsed_time
## 1 2020-03-06 Bogotá D.C. 1 1 0
## 2 2020-03-09 Antioquia 1 1 3
## 3 2020-03-09 Valle del Cauca 1 1 3
## 4 2020-03-11 Antioquia 3 4 5
## 5 2020-03-11 Bogotá D.C. 2 3 5
## 6 2020-03-12 Bogotá D.C. 2 5 6
## 7 2020-03-13 Bogotá D.C. 1 6 7
## 8 2020-03-13 Valle del Cauca 1 2 7
## 9 2020-03-14 Antioquia 3 7 8
## 10 2020-03-14 Bogotá D.C. 5 11 8
## 11 2020-03-15 Antioquia 1 8 9
## 12 2020-03-15 Bogotá D.C. 8 19 9
## 13 2020-03-15 Cundinamarca 1 1 9
## 14 2020-03-15 Risaralda 1 1 9
## 15 2020-03-15 Valle del Cauca 1 3 9
## 16 2020-03-16 Bogotá D.C. 11 30 10
## 17 2020-03-16 Cundinamarca 1 2 10
## 18 2020-03-17 Bogotá D.C. 9 39 11
## 19 2020-03-17 Valle del Cauca 2 5 11
## 20 2020-03-18 Bogotá D.C. 5 44 12
## 21 2020-03-18 Cundinamarca 2 4 12
## 22 2020-03-18 Risaralda 4 5 12
## 23 2020-03-18 Valle del Cauca 8 13 12
## 24 2020-03-19 Antioquia 3 11 13
## 25 2020-03-19 Bogotá D.C. 8 52 13
## 26 2020-03-19 Cundinamarca 1 5 13
## 27 2020-03-19 Valle del Cauca 1 14 13
## 28 2020-03-20 Antioquia 11 22 14
## 29 2020-03-20 Bogotá D.C. 29 81 14
## 30 2020-03-20 Cundinamarca 3 8 14
## 31 2020-03-20 Risaralda 1 6 14
## 32 2020-03-20 Valle del Cauca 1 15 14
## 33 2020-03-21 Antioquia 3 25 15
## 34 2020-03-21 Bogotá D.C. 6 87 15
## 35 2020-03-21 Cundinamarca 1 9 15
## 36 2020-03-21 Risaralda 2 8 15
## 37 2020-03-21 Valle del Cauca 11 26 15
## 38 2020-03-22 Antioquia 5 30 16
## 39 2020-03-22 Bogotá D.C. 4 91 16
## 40 2020-03-22 Risaralda 5 13 16
## 41 2020-03-22 Valle del Cauca 5 31 16
## 42 2020-03-23 Antioquia 22 52 17
## 43 2020-03-23 Bogotá D.C. 22 113 17
## 44 2020-03-23 Cundinamarca 5 14 17
## 45 2020-03-23 Risaralda 1 14 17
## 46 2020-03-23 Valle del Cauca 6 37 17
## 47 2020-03-24 Bogotá D.C. 48 161 18
## 48 2020-03-24 Cundinamarca 7 21 18
## 49 2020-03-24 Risaralda 3 17 18
## 50 2020-03-24 Valle del Cauca 29 66 18
## 51 2020-03-25 Antioquia 8 60 19
## 52 2020-03-25 Bogotá D.C. 16 177 19
## 53 2020-03-25 Cundinamarca 1 22 19
## 54 2020-03-25 Risaralda 2 19 19
## 55 2020-03-25 Valle del Cauca 5 71 19
## 56 2020-03-26 Bogotá D.C. 7 184 20
## 57 2020-03-26 Valle del Cauca 2 73 20
## 58 2020-03-27 Bogotá D.C. 38 222 21
## 59 2020-03-27 Cundinamarca 1 23 21
## 60 2020-03-28 Antioquia 7 67 22
## 61 2020-03-28 Bogotá D.C. 38 260 22
## 62 2020-03-28 Cundinamarca 2 25 22
## 63 2020-03-28 Valle del Cauca 11 84 22
## 64 2020-03-29 Antioquia 19 86 23
## 65 2020-03-29 Bogotá D.C. 33 293 23
## 66 2020-03-29 Risaralda 10 29 23
## 67 2020-03-29 Valle del Cauca 8 92 23
## 68 2020-03-30 Antioquia 10 96 24
## 69 2020-03-30 Bogotá D.C. 56 349 24
## 70 2020-03-30 Cundinamarca 4 29 24
## 71 2020-03-30 Valle del Cauca 13 105 24
## 72 2020-03-31 Antioquia 5 101 25
## 73 2020-03-31 Bogotá D.C. 41 390 25
## 74 2020-03-31 Cundinamarca 9 38 25
## 75 2020-03-31 Risaralda 6 35 25
## 76 2020-03-31 Valle del Cauca 11 116 25
## 77 2020-04-01 Antioquia 6 107 26
## 78 2020-04-01 Bogotá D.C. 82 472 26
## 79 2020-04-01 Cundinamarca 4 42 26
## 80 2020-04-01 Valle del Cauca 31 147 26
## 81 2020-04-02 Antioquia 20 127 27
## 82 2020-04-02 Bogotá D.C. 70 542 27
## 83 2020-04-02 Valle del Cauca 3 150 27
Missing
I still didn’t integrate the “other part” of the dataset, the one concerning deaths!
Ideas
For what concerns the predictive model we want to build, I think that we should start by something very simple (e.g. a (log)linear model) and take it as a baseline.
Then we build something more complex (such as a hierarchical model) and see the improvements with respect to the baseline.
If possible I would put inside something bayesian, since I understood that they really like this kind of stuff!
Gaia I start here with my stream of consciousness about the analysis!
Let’s fix some terminology (references for these are the first two links of the section “Interesting Links”):
- a
caseis a person who tested positive for Covid-19; - the
epidemic curverepresents the daily incremental incidence;
Our data are tabulated by date of confirmation by lab test.
The variable we want to predict, say y, (dependent variable) is the (cumulative) number of confirmed cases, namely we want to simulate a (hopefully plausible) epidemic trajectory. Eventually we will project future incidence.
Since our y is a discrete count variable, a linear regression is not appropriate.
The most common choice with count data is to use Poisson regression, which belongs to the GLM class of models.
\[ ln\lambda_i = \beta_0 + \beta_1x_i \\ y_i \sim \mathcal{Poisson}(\lambda_i) \]
I think that, as first step, we should consider the most parsimonious model, taking only the time as independent variable. Here time can be intended both as the Fecha de diagnostico attribute of our dataset, or as the number of days elapsed from the earlier day in the dataset (in this case we should derive this attribute).
Further steps will include more covariates, and a model selection phase should follow. At the moment we are basically working with time series data.
Important: recall that modelling count data with a Poisson regression we are assuming equidispersion (the mean coincides with the variance), however we have no guarantees that this is true for our data, we need to take this into account!
The days 7/03/20, 8/03/20, 10/03/20 are missing, probably because no case was detected in those days (consistent with the fact that it was the very beginning of the outbreak).
cumulative2 <- cumulative %>% mutate(BETWEEN0 = as.numeric(difftime(`Fecha de diagnóstico`,
lag(`Fecha de diagnóstico`, 1))), BETWEEN = ifelse(is.na(BETWEEN0), 0, BETWEEN0),
elapsed_time = cumsum(as.numeric(BETWEEN))) %>% dplyr::select(-c(BETWEEN0,
BETWEEN))
# just for praticality
names(cumulative2) <- c("date", "y", "elapsed_time")
cumulative2## date y elapsed_time
## 1 2020-03-06 1 0
## 2 2020-03-09 3 3
## 3 2020-03-11 8 5
## 4 2020-03-12 10 6
## 5 2020-03-13 13 7
## 6 2020-03-14 21 8
## 7 2020-03-15 34 9
## 8 2020-03-16 46 10
## 9 2020-03-17 58 11
## 10 2020-03-18 82 12
## 11 2020-03-19 99 13
## 12 2020-03-20 144 14
## 13 2020-03-21 169 15
## 14 2020-03-22 197 16
## 15 2020-03-23 259 17
## 16 2020-03-24 356 18
## 17 2020-03-25 404 19
## 18 2020-03-26 414 20
## 19 2020-03-27 457 21
## 20 2020-03-28 517 22
## 21 2020-03-29 593 23
## 22 2020-03-30 678 24
## 23 2020-03-31 758 25
## 24 2020-04-01 899 26
## 25 2020-04-02 993 27
Description of variables
ID de caso. ID of the confirmed case.
Fecha de diagnóstico. Date that the disease was diagnosed.
Ciudad de ubicación. City where the case was diagnosed.
Departamento o Distrito. Department or distric that the city belongs to.
Atención. Situation of the pacient: recovered, at home, at the hospital, at the ICU or deceased.
Edad. Age of the confirmed case.
Sexo. Sex of the confirmed cade.
Tipo. How the person got infected: in Colombia, abroad or unknown.
País de procedencia. Country of origin in case the person got infected abroad.
Country of origin
Cleaning the “País de procedencia” variable and creating another variable related to the continents. As País de procedencia is too fragmented is good to group them into continents and see if there is a difference in the model.
We clean the column País de procedencia by making sure that there are no cities instead of countries and that the countries are separated with a dash (we’ll use it with dummy variables)
## [1] "0"
## [2] "Alemania"
## [3] "Alemania - Estambul"
## [4] "Arabia"
## [5] "Argentina"
## [6] "Aruba"
## [7] "Bélgica"
## [8] "Brasil"
## [9] "Canadá"
## [10] "Chile"
## [11] "Colombia"
## [12] "Costa Rica"
## [13] "Croacia"
## [14] "Cuba"
## [15] "Ecuador"
## [16] "Egipto"
## [17] "Emiratos Árabes"
## [18] "España"
## [19] "España - Croacia"
## [20] "España - Croacia - Bosnia"
## [21] "España - Egipto"
## [22] "España - Francia"
## [23] "España - India"
## [24] "España - Italia"
## [25] "España - Turquia"
## [26] "Estados Unidos"
## [27] "Europa"
## [28] "Francia"
## [29] "Francia - Holanda"
## [30] "Grecia"
## [31] "Guatemala"
## [32] "Inglaterra"
## [33] "Isla Martin - Caribe"
## [34] "Islas San Martin"
## [35] "Israel Egipto"
## [36] "Italia"
## [37] "Italia - España - Turquía"
## [38] "Italia - Jamaica - Panamá"
## [39] "Italia - Ucrania - España"
## [40] "Jamaica"
## [41] "Jamaica - Isla Caimán - Panamá"
## [42] "Jamaica - Panamá - Isla Caimán"
## [43] "Jamaica - Panamá - Islas del caribe - Cartagena"
## [44] "Londres"
## [45] "Madrid"
## [46] "Marruecos"
## [47] "México"
## [48] "Panamá"
## [49] "Panamá - Jamaica"
## [50] "Perú"
## [51] "Puerto Rico"
## [52] "República Dominicana"
## [53] "Suiza"
## [54] "Turquía"
## [55] "Turquía - Grecia"
## [56] "Venezuela"
# There is an observation that has a 0 as País de procedencia
which(colombia_covid$`País de procedencia` == "0")## [1] 911
## # A tibble: 1 x 9
## `ID de caso` `Fecha de diagn~ `Ciudad de ubic~ `Departamento o~
## <int> <date> <chr> <chr>
## 1 911 2020-04-02 Medellín Antioquia
## # ... with 5 more variables: `Atención**` <chr>, Edad <dbl>, Sexo <chr>,
## # tipo <chr>, `País de procedencia` <chr>
# Standardizing the country names
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Isla Martin - Caribe"] <- "Islas San Martin"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Israel Egipto"] <- "Israel - Egipto"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Jamaica - Isla Caimán - Panamá"] <- "Jamaica - Panamá - Isla Caimán"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Madrid"] <- "España"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Londres"] <- "Inglaterra"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Alemania - Estambul"] <- "Alemania - Turquía"
colombia_covid$`País de procedencia`[colombia_covid$`País de procedencia` ==
"Jamaica - Panamá - Islas del caribe - Cartagena"] <- "Jamaica - Panamá - Colombia"
# Creating continents
Europa <- c("Alemania", "Bélgica", "Europa", "Croacia", "España", "España - Croacia",
"España - Croacia - Bosnia", "España - Francia", "España - Italia", "Francia",
"Francia - Holanda", "Grecia", "Inglaterra", "Italia", "Italia - Ucrania - España",
"Suiza")
Asia <- c("Arabia", "Emiratos Árabes", "Turquía")
África <- c("Egipto", "Marruecos")
Norteamérica <- c("Canadá", "Estados Unidos", "México")
Centroamérica <- c("Aruba", "Costa Rica", "Cuba", "Guatemala", "Islas San Martin",
"Jamaica", "Jamaica - Panamá - Isla Caimán", "Jamaica - Panamá - Islas del caribe - Cartagena",
"Panamá", "Panamá - Jamaica", "Puerto Rico", "República Dominicana")
Sudamerica <- c("Argentina", "Brasil", "Chile", "Ecuador", "Perú", "Venezuela")
# Alemania - Turquía', 'España - India', 'España - Turquia', 'Italia -
# España - Turquía', 'Turquía - Grecia' -> 'Europa - Asia' 'España - Egipto'
# -> 'Europa - África' 'Israel - Egipto' -> 'Asia - África' 'Italia -
# Jamaica - Panamá' -> 'Europa - Centroamérica' 'Colombia' -> 'Colombia'
for (i in 1:nrow(colombia_covid)) {
if (colombia_covid$`País de procedencia`[i] %in% Europa) {
colombia_covid$`Continente de procedencia`[i] <- "Europa"
} else if (colombia_covid$`País de procedencia`[i] %in% Asia) {
colombia_covid$`Continente de procedencia`[i] <- "Asia"
} else if (colombia_covid$`País de procedencia`[i] %in% África) {
colombia_covid$`Continente de procedencia`[i] <- "África"
} else if (colombia_covid$`País de procedencia`[i] %in% Norteamérica) {
colombia_covid$`Continente de procedencia`[i] <- "Norteamérica"
} else if (colombia_covid$`País de procedencia`[i] %in% Centroamérica) {
colombia_covid$`Continente de procedencia`[i] <- "Centroamérica"
} else if (colombia_covid$`País de procedencia`[i] %in% Sudamerica) {
colombia_covid$`Continente de procedencia`[i] <- "Sudamerica"
} else if (colombia_covid$`País de procedencia`[i] == "Colombia") {
colombia_covid$`Continente de procedencia`[i] <- "Colombia"
} else if (colombia_covid$`País de procedencia`[i] == "Alemania - Turquía") {
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia"
} else if (colombia_covid$`País de procedencia`[i] == "España - India")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Turquia")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Italia - España - Turquía")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "Turquía - Grecia")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Asia" else if (colombia_covid$`País de procedencia`[i] == "España - Egipto")
colombia_covid$`Continente de procedencia`[i] <- "Europa - África" else if (colombia_covid$`País de procedencia`[i] == "Israel - Egipto")
colombia_covid$`Continente de procedencia`[i] <- "Asia - África" else if (colombia_covid$`País de procedencia`[i] == "Italia - Jamaica - Panamá")
colombia_covid$`Continente de procedencia`[i] <- "Europa - Centroamérica"
}
# Transforming the 0 value into a null value
library(naniar)
colombia_covid <- colombia_covid %>% replace_with_na(replace = list(`País de procedencia` = 0))
# colombia_covid[911,]$`País de procedencia` <- NAPreparation of the dataset to run the models
Giullia. So, as we have to group the data by date to be able to run the models, we have to count the occurance of the categorical variables in each day. So that’s what I did below and described each step in detail.
Dropping the variables Ciudad de ubicación, Atención** and Tipo*.
Considering the variable “Ciudad de ubicación” I decided to keep just the departments variable to focus the analysis in dividing the region into departments and not in cities.
Considering the variable "Atención**" at the moment I think is not relevant to know how many confirmed cases there will be, but we can add later and check.
Considering the variable "Tipo*" many observations are “in study”, so it is better to use just the variable country of origin (País de procedencia), because they consider the ones “in study” as Colombia.
colnames(colombia_covid)[8] <- "Tipo*"
covid19 <- dplyr::select(colombia_covid, -c(`Ciudad de ubicación`, `Atención**`,
`Tipo*`))Transforming the age values into age ranges to be able to transform it in a dummy variable
covid19 <- covid19 %>% mutate(Edad = case_when(Edad <= 18 ~ "0 a 18", Edad >=
19 & Edad <= 30 ~ "19 a 30", Edad >= 31 & Edad <= 45 ~ "31 a 45", Edad >=
46 & Edad <= 60 ~ "46 a 60", Edad >= 61 & Edad <= 75 ~ "60 a 75", Edad >=
76 ~ "76+"))Transform the categorical variables into dummy variables
library(fastDummies)
covid19_dummy <- dummy_cols(covid19, select_columns = c("Departamento o Distrito",
"Edad", "Sexo", "País de procedencia", "Continente de procedencia"), remove_first_dummy = TRUE,
ignore_na = TRUE, split = "-", remove_selected_columns = TRUE)Counting the occurance of the dummy variables by date and deleting the Fecha de diagnóstico and ID de caso because I will bind with the cumulative2 table made by Gaia.
Running Poisson
Running poisson with just the variable representing the time as predictor
# Running poisson with just the variable representing the time as predictor
attach(data1)
poisson1 <- glm(y ~ elapsed_time, family = poisson)
summary(poisson1)##
## Call:
## glm(formula = y ~ elapsed_time, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7629 -4.0644 -0.8324 1.5642 6.4039
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.461377 0.052885 46.54 <2e-16 ***
## elapsed_time 0.169656 0.002346 72.32 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 280.62 on 23 degrees of freedom
## AIC: 446.83
##
## Number of Fisher Scoring iterations: 4
Results show the variable is very significant.
# checking for overdispersion plotting standardized residuals versus the
# fitted values
pred.pois <- poisson1$fitted.values
res.st <- (y - pred.pois)/sqrt(pred.pois)
plot(pred.pois, res.st)
abline(h = 0, lty = 3, col = "gray75")According to the plot there is presence of overdispersion because the standardized residuals are outside the -1 to 1 range assumed by a Poisson distribution. In addition, the residuals are also distributed according to a Poisson. This is surprising and I don’t know what does it mean. I have to research.
# calculating overdispersion n=25 k=2 n-k=23
est.overdispersion <- sum(res.st^2)/23
est.overdispersion## [1] 11.05917
Extremely high!!!
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.7629 -4.0644 -0.8324 -0.6803 1.5642 6.4039
paste0(c("Null deviance: ", "Residual deviance: "), round(c(poisson1$null.deviance,
deviance(poisson1)), 2))## [1] "Null deviance: 7859.81" "Residual deviance: 280.62"
Stan model just to compare
model.Poisson <- stan_model("stan/poisson_regression.stan")
# arrange things
model.data <- list(N = nrow(data1), cases = data1$y, time = data1$elapsed_time)
# str(model.Poisson.data)
# run the model
fit.model.Poisson <- sampling(model.Poisson, data = model.data)##
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 1).
## Chain 1:
## Chain 1: Gradient evaluation took 0 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1:
## Chain 1:
## Chain 1: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 1: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 1: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 1: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 1: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 1: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 1: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 1: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 1: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 1: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 1: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 1: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 1:
## Chain 1: Elapsed Time: 0.152 seconds (Warm-up)
## Chain 1: 0.149 seconds (Sampling)
## Chain 1: 0.301 seconds (Total)
## Chain 1:
##
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 2).
## Chain 2:
## Chain 2: Gradient evaluation took 0 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2:
## Chain 2:
## Chain 2: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 2: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 2: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 2: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 2: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 2: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 2: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 2: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 2: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 2: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 2: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 2: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 2:
## Chain 2: Elapsed Time: 0.216 seconds (Warm-up)
## Chain 2: 0.138 seconds (Sampling)
## Chain 2: 0.354 seconds (Total)
## Chain 2:
##
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 3).
## Chain 3:
## Chain 3: Gradient evaluation took 0 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3:
## Chain 3:
## Chain 3: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 3: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 3: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 3: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 3: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 3: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 3: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 3: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 3: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 3: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 3: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 3: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 3:
## Chain 3: Elapsed Time: 0.18 seconds (Warm-up)
## Chain 3: 0.103 seconds (Sampling)
## Chain 3: 0.283 seconds (Total)
## Chain 3:
##
## SAMPLING FOR MODEL 'poisson_regression' NOW (CHAIN 4).
## Chain 4:
## Chain 4: Gradient evaluation took 0 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4:
## Chain 4:
## Chain 4: Iteration: 1 / 2000 [ 0%] (Warmup)
## Chain 4: Iteration: 200 / 2000 [ 10%] (Warmup)
## Chain 4: Iteration: 400 / 2000 [ 20%] (Warmup)
## Chain 4: Iteration: 600 / 2000 [ 30%] (Warmup)
## Chain 4: Iteration: 800 / 2000 [ 40%] (Warmup)
## Chain 4: Iteration: 1000 / 2000 [ 50%] (Warmup)
## Chain 4: Iteration: 1001 / 2000 [ 50%] (Sampling)
## Chain 4: Iteration: 1200 / 2000 [ 60%] (Sampling)
## Chain 4: Iteration: 1400 / 2000 [ 70%] (Sampling)
## Chain 4: Iteration: 1600 / 2000 [ 80%] (Sampling)
## Chain 4: Iteration: 1800 / 2000 [ 90%] (Sampling)
## Chain 4: Iteration: 2000 / 2000 [100%] (Sampling)
## Chain 4:
## Chain 4: Elapsed Time: 0.167 seconds (Warm-up)
## Chain 4: 0.092 seconds (Sampling)
## Chain 4: 0.259 seconds (Total)
## Chain 4:
## Inference for Stan model: poisson_regression.
## 4 chains, each with iter=2000; warmup=1000; thin=1;
## post-warmup draws per chain=1000, total post-warmup draws=4000.
##
## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat
## alpha 2.46 0 0.05 2.35 2.42 2.46 2.49 2.56 476 1.01
## beta 0.17 0 0.00 0.17 0.17 0.17 0.17 0.17 507 1.01
##
## Samples were drawn using NUTS(diag_e) at Thu Jul 02 06:00:20 2020.
## For each parameter, n_eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor on split chains (at
## convergence, Rhat=1).
y_rep <- as.matrix(fit.model.Poisson, pars = "y_rep")
ppc_dens_overlay(y = model.data$cases, y_rep[1:200, ])Let’s apply a quasi poisson and see what happens
##
## Call:
## glm(formula = y ~ elapsed_time, family = quasipoisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7629 -4.0644 -0.8324 1.5642 6.4039
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.461377 0.175873 13.99 9.69e-13 ***
## elapsed_time 0.169656 0.007801 21.75 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 11.0593)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 280.62 on 23 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
# plotting standardized residuals versus the fitted values
pred.poisq <- poisson1quasi$fitted.values
res.stq <- (y - pred.poisq)/sqrt(summary(poisson1quasi)$dispersion * pred.poisq)
plot(pred.poisq, res.stq)
abline(h = 0, lty = 3, col = "gray75")## [1] 0.9999886
So I applied the quasi poisson and the overdispersion reduced to 1. Anyway, the residuals are still behaving like a distribution. Have to understand why is that.
Running poisson with the variable representing time plus gender
# Running poisson with the variable representing time plus gender
poisson2 <- glm(y ~ elapsed_time + Sexo_M, family = poisson)
summary(poisson2)##
## Call:
## glm(formula = y ~ elapsed_time + Sexo_M, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.7833 -4.0135 -0.7867 1.4556 6.8191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.441994 0.057694 42.327 <2e-16 ***
## elapsed_time 0.172311 0.003911 44.055 <2e-16 ***
## Sexo_M -0.001086 0.001279 -0.849 0.396
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.8 on 24 degrees of freedom
## Residual deviance: 279.9 on 22 degrees of freedom
## AIC: 448.11
##
## Number of Fisher Scoring iterations: 4
Results show that the variable gender is not signficant and thus doesn’t change the performance of the model. We were already expecting this as in the data visualization step the gender is equaly distributed across the confirmed cases.
Running poisson with the variable representing time plus the age variables
# Running poisson with the variable representing time plus the age variables
poisson3 <- glm(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+`, family = poisson)
summary(poisson3)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.471 -2.342 -1.151 1.225 4.598
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.389641 0.061467 38.877 < 2e-16 ***
## elapsed_time 0.168236 0.004012 41.936 < 2e-16 ***
## `Edad_19 a 30` -0.019522 0.005239 -3.726 0.000194 ***
## `Edad_31 a 45` 0.012599 0.002618 4.813 1.49e-06 ***
## `Edad_46 a 60` 0.007874 0.004040 1.949 0.051281 .
## `Edad_60 a 75` -0.031843 0.005307 -6.000 1.97e-09 ***
## `Edad_76+` 0.154509 0.018747 8.242 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 200.75 on 18 degrees of freedom
## AIC: 376.95
##
## Number of Fisher Scoring iterations: 4
Results show that 4 out of the 5 age ranges are significant predictors, showing that age does have an effect in predicting the number of confirmed cases. Furthermore, it can be seen that the AIC reduced in comparison with the model that has just the variable time as predictor (poisson1).
Plotting the residuals versus fit and calculating the overdispersion
pred.pois3 <- poisson3$fitted.values
res.st3 <- (y - pred.pois3)/sqrt(pred.pois3)
plot(pred.pois3, res.st3)
abline(h = 0, lty = 3, col = "gray75")# calculating overdispersion n=25 k=7 n-k=18
est.overdispersion <- sum(res.st3^2)/18
est.overdispersion## [1] 10.00153
The overdispersion is high and again the residuals continue to behave like a distribution.
predict.confidence <- function(object, newdata, level = 0.95, ...) {
if (!is(object, "glm")) {
stop("Model should be a glm")
}
if (!is(newdata, "data.frame")) {
stop("Plase input a data frame for newdata")
}
if (!is.numeric(level) | level < 0 | level > 1) {
stop("level should be numeric and between 0 and 1")
}
ilink <- family(object)$linkinv
ci.factor <- qnorm(1 - (1 - level)/2)
# calculate CIs:
fit <- predict(object, newdata = newdata, level = level, type = "link",
se.fit = TRUE, ...)
lwr <- ilink(fit$fit - ci.factor * fit$se.fit)
upr <- ilink(fit$fit + ci.factor * fit$se.fit)
df <- data.frame(fit = ilink(fit$fit), lwr = lwr, upr = upr)
return(df)
}
conf.df <- predict.confidence(poisson3, as.data.frame(data1$y))
conf.df## fit lwr upr
## 1 10.69867 9.483496 12.06955
## 2 18.44549 16.642757 20.44350
## 3 23.58488 21.414156 25.97565
## 4 30.69966 28.202107 33.41840
## 5 33.91301 31.241180 36.81334
## 6 39.38810 36.390180 42.63299
## 7 43.25180 39.624844 47.21074
## 8 58.28427 54.096571 62.79615
## 9 63.86224 59.249298 68.83432
## 10 101.87534 94.383113 109.96232
## 11 94.17614 88.682207 100.01043
## 12 98.45408 90.523298 107.07969
## 13 125.57281 117.799392 133.85919
## 14 157.72885 149.090560 166.86765
## 15 224.15214 202.446432 248.18507
## 16 276.14626 255.037191 299.00250
## 17 379.86926 354.915995 406.57693
## 18 354.95697 333.757447 377.50305
## 19 508.91939 482.062562 537.27247
## 20 496.10852 473.114007 520.22063
## 21 568.46889 534.472988 604.62715
## 22 708.40069 670.781153 748.13005
## 23 819.09022 782.791250 857.07242
## 24 917.89085 863.784736 975.38608
## 25 1059.06144 1005.617004 1115.34623
ggplot(data1, aes(elapsed_time, y)) + geom_ribbon(aes(x = elapsed_time, ymin = conf.df$lwr,
ymax = conf.df$upr), data = data1, fill = color_scheme_get("blue")[[2]]) +
geom_line(aes(x = elapsed_time, y = conf.df$fit), data = data1, color = color_scheme_get("blue")[[4]],
size = 1.1) + geom_point(aes(x = elapsed_time, y = y)) + xlab("Days") +
ylab("Total cases") + scale_x_discrete(limit = c(8, 15, 22, 29, 36), labels = c("2-3",
"9-3", "16-3", "23-3", "30-3")) + # facet_wrap('reg', scales ='free')+
theme(strip.text.x = element_text(size = 12, colour = "black"), axis.text.x = element_text(face = "bold",
color = "#993333", angle = 45, size = 9), axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22))# Let's apply a quasi poisson and see what happens
poisson3quasi <- glm(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+`, family = quasipoisson)
summary(poisson3quasi)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, family = quasipoisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.471 -2.342 -1.151 1.225 4.598
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.389641 0.194392 12.293 3.42e-10 ***
## elapsed_time 0.168236 0.012687 13.260 9.95e-11 ***
## `Edad_19 a 30` -0.019522 0.016569 -1.178 0.2540
## `Edad_31 a 45` 0.012599 0.008278 1.522 0.1454
## `Edad_46 a 60` 0.007874 0.012776 0.616 0.5454
## `Edad_60 a 75` -0.031843 0.016783 -1.897 0.0740 .
## `Edad_76+` 0.154509 0.059290 2.606 0.0179 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 10.00169)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 200.75 on 18 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
When we apply the quasi poisson model the age variables become not signficant anymore. But we know they are. So, I think is not a problem.
# plotting standardized residuals versus the fitted values
pred.poisq3 <- poisson3quasi$fitted.values
res.stq3 <- (y - pred.poisq3)/sqrt(summary(poisson3quasi)$dispersion * pred.poisq3)
plot(pred.poisq3, res.stq3)
abline(h = 0, lty = 3, col = "gray75")## [1] 0.9999845
By applying the quasi poisson the overdispersion reduced to 1, and again the residuals are still behaving like a distribution.
Running poisson with the variable representing time, age and departments as predictors
# Running poisson with the variable representing time, age and departments
# as predictors
poisson4 <- glm(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
`Departamento o Distrito_Valle del Cauca`, family = poisson)
summary(poisson4)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.89793 -0.11401 0.01296 0.18169 0.83696
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.73607 0.18896 3.895
## elapsed_time 0.28751 0.02018 14.245
## `Edad_19 a 30` 0.05362 0.02475 2.166
## `Edad_31 a 45` 0.08583 0.02787 3.080
## `Edad_46 a 60` 0.02671 0.03566 0.749
## `Edad_60 a 75` 0.19705 0.15274 1.290
## `Edad_76+` -0.33800 0.43974 -0.769
## `Departamento o Distrito_Bogotá D.C.` -0.10930 0.03963 -2.758
## `Departamento o Distrito_Boyacá` 0.25277 0.73715 0.343
## `Departamento o Distrito_Caldas` 0.46662 0.26875 1.736
## `Departamento o Distrito_Casanare` -2.08700 1.82542 -1.143
## `Departamento o Distrito_Cauca` -0.41716 0.36529 -1.142
## `Departamento o Distrito_Cundinamarca` 0.16497 0.04484 3.679
## `Departamento o Distrito_Meta` -0.27179 0.12534 -2.168
## `Departamento o Distrito_Quindío` 0.20614 0.37399 0.551
## `Departamento o Distrito_Risaralda` -0.38367 0.18819 -2.039
## `Departamento o Distrito_Santander` 0.41024 0.18668 2.198
## `Departamento o Distrito_Tolima` 0.68272 0.26056 2.620
## `Departamento o Distrito_Valle del Cauca` -0.13527 0.05665 -2.388
## Pr(>|z|)
## (Intercept) 9.81e-05 ***
## elapsed_time < 2e-16 ***
## `Edad_19 a 30` 0.030315 *
## `Edad_31 a 45` 0.002073 **
## `Edad_46 a 60` 0.453829
## `Edad_60 a 75` 0.197009
## `Edad_76+` 0.442108
## `Departamento o Distrito_Bogotá D.C.` 0.005820 **
## `Departamento o Distrito_Boyacá` 0.731678
## `Departamento o Distrito_Caldas` 0.082519 .
## `Departamento o Distrito_Casanare` 0.252915
## `Departamento o Distrito_Cauca` 0.253451
## `Departamento o Distrito_Cundinamarca` 0.000234 ***
## `Departamento o Distrito_Meta` 0.030127 *
## `Departamento o Distrito_Quindío` 0.581506
## `Departamento o Distrito_Risaralda` 0.041476 *
## `Departamento o Distrito_Santander` 0.027984 *
## `Departamento o Distrito_Tolima` 0.008789 **
## `Departamento o Distrito_Valle del Cauca` 0.016952 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.8137 on 24 degrees of freedom
## Residual deviance: 3.3026 on 6 degrees of freedom
## AIC: 203.51
##
## Number of Fisher Scoring iterations: 4
The results show that by including the variables representing departments the AIC reduces significantly in comparison with the previous model that included just time and age as predictors. In addition, 8 out of 12 of the dummy variables representing the departments are significant.
# plotting the residuals versus fit to analyze the overdispersion
pred.pois4 <- poisson4$fitted.values
res.st4 <- (y - pred.pois4)/sqrt(pred.pois4)
plot(pred.pois4, res.st4)
abline(h = 0, lty = 3, col = "gray75")# calculating overdispersion n=25 k=19 n-k=6
est.overdispersion <- sum(res.st4^2)/6
est.overdispersion## [1] 0.5168401
Now the residuals are behaving differently and the overdispersion reduced significantly and it is inside the range assumed by a Poisson model, that is from -1 to 1. Therefore, there is no need to apply a quasi poisson model because the assumptions of the Poisson distribution are holding.
Running poisson with the variable representing time, age, departments and continent of origin as predictors
# Running poisson with the variable representing time, age, departments and
# continent of origin as predictors
poisson5 <- glm(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
`Departamento o Distrito_Valle del Cauca` + `Continente de procedencia_Asia` +
`Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
`Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
`Continente de procedencia_Sudamerica`, family = poisson)
summary(poisson5)##
## Call:
## glm(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca` + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, family = poisson)
##
## Deviance Residuals:
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [24] 0 0
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) -0.28738 1.72656 -0.166
## elapsed_time 0.35366 0.09012 3.924
## `Edad_19 a 30` 0.33992 1.29758 0.262
## `Edad_31 a 45` 0.28653 0.73462 0.390
## `Edad_46 a 60` 0.27472 1.30079 0.211
## `Edad_60 a 75` 0.14618 1.61963 0.090
## `Edad_76+` 1.20137 4.83430 0.249
## `Departamento o Distrito_Bogotá D.C.` -0.01024 0.66849 -0.015
## `Departamento o Distrito_Boyacá` -1.76609 5.65082 -0.313
## `Departamento o Distrito_Caldas` -1.97829 8.25311 -0.240
## `Departamento o Distrito_Casanare` 5.30073 26.46443 0.200
## `Departamento o Distrito_Cauca` 1.17006 5.91217 0.198
## `Departamento o Distrito_Cundinamarca` -0.11854 1.10533 -0.107
## `Departamento o Distrito_Meta` -0.09535 0.59699 -0.160
## `Departamento o Distrito_Quindío` -1.00160 3.58050 -0.280
## `Departamento o Distrito_Risaralda` 0.17522 2.74291 0.064
## `Departamento o Distrito_Santander` 0.12443 3.55810 0.035
## `Departamento o Distrito_Tolima` 0.57417 3.82009 0.150
## `Departamento o Distrito_Valle del Cauca` -0.15165 0.63455 -0.239
## `Continente de procedencia_Asia` -0.99501 2.85962 -0.348
## `Continente de procedencia_Centroamérica` -0.05926 0.59201 -0.100
## `Continente de procedencia_Colombia` -0.31027 1.63761 -0.189
## `Continente de procedencia_Europa` -0.04230 1.34447 -0.031
## `Continente de procedencia_Norteamérica` -0.08012 0.54016 -0.148
## `Continente de procedencia_Sudamerica` -0.63692 1.26602 -0.503
## Pr(>|z|)
## (Intercept) 0.868
## elapsed_time 8.7e-05 ***
## `Edad_19 a 30` 0.793
## `Edad_31 a 45` 0.697
## `Edad_46 a 60` 0.833
## `Edad_60 a 75` 0.928
## `Edad_76+` 0.804
## `Departamento o Distrito_Bogotá D.C.` 0.988
## `Departamento o Distrito_Boyacá` 0.755
## `Departamento o Distrito_Caldas` 0.811
## `Departamento o Distrito_Casanare` 0.841
## `Departamento o Distrito_Cauca` 0.843
## `Departamento o Distrito_Cundinamarca` 0.915
## `Departamento o Distrito_Meta` 0.873
## `Departamento o Distrito_Quindío` 0.780
## `Departamento o Distrito_Risaralda` 0.949
## `Departamento o Distrito_Santander` 0.972
## `Departamento o Distrito_Tolima` 0.881
## `Departamento o Distrito_Valle del Cauca` 0.811
## `Continente de procedencia_Asia` 0.728
## `Continente de procedencia_Centroamérica` 0.920
## `Continente de procedencia_Colombia` 0.850
## `Continente de procedencia_Europa` 0.975
## `Continente de procedencia_Norteamérica` 0.882
## `Continente de procedencia_Sudamerica` 0.615
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7.8598e+03 on 24 degrees of freedom
## Residual deviance: 3.2641e-13 on 0 degrees of freedom
## AIC: 212.2
##
## Number of Fisher Scoring iterations: 3
The AIC actually increased by adding the continent of origin variables and none of them are significant.
# Let's confirm if the continent of origin variable is not significant at
# all when being a predictor just together with the time variable.
poisson6 <- glm(y ~ elapsed_time + `Continente de procedencia_Asia` + `Continente de procedencia_Centroamérica` +
`Continente de procedencia_Colombia` + `Continente de procedencia_Europa` +
`Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`,
family = poisson)
summary(poisson6)##
## Call:
## glm(formula = y ~ elapsed_time + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, family = poisson)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.178 -3.466 -1.204 1.988 6.397
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 2.2264438 0.0784497 28.381
## elapsed_time 0.1774725 0.0040070 44.291
## `Continente de procedencia_Asia` -0.0372645 0.0099577 -3.742
## `Continente de procedencia_Centroamérica` -0.0040363 0.0065343 -0.618
## `Continente de procedencia_Colombia` 0.0004037 0.0010424 0.387
## `Continente de procedencia_Europa` 0.0118714 0.0054906 2.162
## `Continente de procedencia_Norteamérica` -0.0073427 0.0052240 -1.406
## `Continente de procedencia_Sudamerica` 0.0264740 0.0114359 2.315
## Pr(>|z|)
## (Intercept) < 2e-16 ***
## elapsed_time < 2e-16 ***
## `Continente de procedencia_Asia` 0.000182 ***
## `Continente de procedencia_Centroamérica` 0.536766
## `Continente de procedencia_Colombia` 0.698530
## `Continente de procedencia_Europa` 0.030606 *
## `Continente de procedencia_Norteamérica` 0.159849
## `Continente de procedencia_Sudamerica` 0.020613 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 7859.81 on 24 degrees of freedom
## Residual deviance: 215.47 on 17 degrees of freedom
## AIC: 393.67
##
## Number of Fisher Scoring iterations: 4
In this case the AIC reduced in comparison with the model with just the time as predictor(poisson1), but just 2 out of 6 variables are significant, so we think is not a very relevant predictor to include in the model with the lowest AIC so far (poisson4).
Applying ANOVA to compare the poisson models
We decided to compare poisson1quasi, poisson3quasi, poisson4, because they are nested and we are more interested in seeing if the model 4 is in fact better than the first model.
## Analysis of Deviance Table
##
## Model 1: y ~ elapsed_time
## Model 2: y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
## `Edad_60 a 75` + `Edad_76+`
## Model 3: y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
## `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 23 280.624
## 2 18 200.747 5 79.878 8.901e-16 ***
## 3 6 3.303 12 197.444 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The poisson4 model is the best. It has the lower p value.
Two important observations Angela.
Is it ok that we compare together a poisson with a quasi poisson?
The problem I said I am having is that you can see that the table is giving me just the p value. Is missing other metrics of the test. Go to the comparison of the negative binomial models and you will see that there it is working properly. Is something related to the Poisson model on our data. If I compare just poisson models or just quasi poisson models it continues to give the same problem.
Running Negative Binomial
Running negative binomial with just the variable representing the time as predictor
# Running negative binomial with just the variable representing the time as
# predictor
nb1 <- glm.nb(y ~ elapsed_time)
summary(nb1)##
## Call:
## glm.nb(formula = y ~ elapsed_time, init.theta = 11.25073154,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.81454 -1.19152 -0.02362 0.81855 1.41660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.518348 0.169272 8.97 <2e-16 ***
## elapsed_time 0.219689 0.009522 23.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(11.2507) family taken to be 1)
##
## Null deviance: 477.963 on 24 degrees of freedom
## Residual deviance: 27.849 on 23 degrees of freedom
## AIC: 259.91
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 11.25
## Std. Err.: 3.87
##
## 2 x log-likelihood: -253.908
The predictor is significant and we get an AIC significantly lower than the poisson model runned with the same predictor.
# Plotting standardized residuals vs fitted values
stdres <- rstandard(nb1)
plot(nb1$fitted.values, stdres)
abline(h = 0, col = "gray")The residuals are just in the lowerbound exceeding by approximately one unit the -1 to 1 variation range.
# Calculating overdispersion n=25 k=2 n-k=23
est.overdispersion <- sum(stdres^2)/23
est.overdispersion## [1] 1.33484
The overdispersion is bit higher than 1, but I think it is still an acceptable number. Gioia said than when it gets closer to 2 that you can accept the overdispersion assumption.
Running negative binomial with the variable representing time plus the age variables as predictors
# Running negative binomial with the variable representing time plus the age
# variables as predictors
nb2 <- glm.nb(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+`)
summary(nb2)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`, init.theta = 12.57832368,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.82305 -1.19078 -0.05259 0.56987 1.86010
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.487607 0.194140 7.663 1.82e-14 ***
## elapsed_time 0.225048 0.017409 12.927 < 2e-16 ***
## `Edad_19 a 30` 0.005567 0.023328 0.239 0.811
## `Edad_31 a 45` -0.002274 0.014144 -0.161 0.872
## `Edad_46 a 60` 0.002383 0.022039 0.108 0.914
## `Edad_60 a 75` -0.032190 0.027931 -1.152 0.249
## `Edad_76+` 0.061832 0.087671 0.705 0.481
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(12.5783) family taken to be 1)
##
## Null deviance: 526.498 on 24 degrees of freedom
## Residual deviance: 28.148 on 18 degrees of freedom
## AIC: 267.94
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 12.58
## Std. Err.: 4.41
##
## 2 x log-likelihood: -251.944
The AIC increased in comparison with the previous model(nb1) and the age variables are not significant. That’s awkward. In the Poisson model they are.
Running negative binomial with the variable representing time plus the departments variables as predictors
# Running negative binomial with the variable representing time plus the
# departments variables as predictors
nb3 <- glm.nb(y ~ elapsed_time + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
`Departamento o Distrito_Valle del Cauca`)
summary(nb3)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, init.theta = 82.27821159,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5375 -0.9716 -0.1456 0.7481 1.9049
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.038762 0.177075 5.866
## elapsed_time 0.275143 0.016572 16.603
## `Departamento o Distrito_Bogotá D.C.` -0.019166 0.004033 -4.752
## `Departamento o Distrito_Boyacá` -0.447171 0.140811 -3.176
## `Departamento o Distrito_Caldas` -0.008610 0.086957 -0.099
## `Departamento o Distrito_Casanare` -0.111260 0.193131 -0.576
## `Departamento o Distrito_Cauca` 0.101655 0.048756 2.085
## `Departamento o Distrito_Cundinamarca` 0.109295 0.034207 3.195
## `Departamento o Distrito_Meta` -0.088358 0.049785 -1.775
## `Departamento o Distrito_Quindío` -0.036409 0.048346 -0.753
## `Departamento o Distrito_Risaralda` 0.058404 0.062973 0.927
## `Departamento o Distrito_Santander` -0.122086 0.153665 -0.794
## `Departamento o Distrito_Tolima` 0.073910 0.093845 0.788
## `Departamento o Distrito_Valle del Cauca` -0.011913 0.012701 -0.938
## Pr(>|z|)
## (Intercept) 4.46e-09 ***
## elapsed_time < 2e-16 ***
## `Departamento o Distrito_Bogotá D.C.` 2.01e-06 ***
## `Departamento o Distrito_Boyacá` 0.00149 **
## `Departamento o Distrito_Caldas` 0.92112
## `Departamento o Distrito_Casanare` 0.56456
## `Departamento o Distrito_Cauca` 0.03707 *
## `Departamento o Distrito_Cundinamarca` 0.00140 **
## `Departamento o Distrito_Meta` 0.07594 .
## `Departamento o Distrito_Quindío` 0.45140
## `Departamento o Distrito_Risaralda` 0.35369
## `Departamento o Distrito_Santander` 0.42691
## `Departamento o Distrito_Tolima` 0.43094
## `Departamento o Distrito_Valle del Cauca` 0.34825
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(82.2785) family taken to be 1)
##
## Null deviance: 2214.864 on 24 degrees of freedom
## Residual deviance: 26.185 on 11 degrees of freedom
## AIC: 247.44
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 82.3
## Std. Err.: 36.2
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -217.438
The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just 5 out of the 12 department dummy variables are significant.
Running negative binomial with the variable representing time plus the continent of origin variables as predictors
# Running negative binomial with the variable representing time plus the
# continent of origin variables as predictors
nb4 <- glm.nb(y ~ elapsed_time + `Continente de procedencia_Asia` + `Continente de procedencia_Centroamérica` +
`Continente de procedencia_Colombia` + `Continente de procedencia_Europa` +
`Continente de procedencia_Norteamérica` + `Continente de procedencia_Sudamerica`)
summary(nb4)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Continente de procedencia_Asia` +
## `Continente de procedencia_Centroamérica` + `Continente de procedencia_Colombia` +
## `Continente de procedencia_Europa` + `Continente de procedencia_Norteamérica` +
## `Continente de procedencia_Sudamerica`, init.theta = 22.72556147,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5410 -0.7776 0.1004 0.4496 2.2821
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 1.064167 0.206173 5.162
## elapsed_time 0.239010 0.013619 17.550
## `Continente de procedencia_Asia` -0.018862 0.044232 -0.426
## `Continente de procedencia_Centroamérica` -0.041791 0.029116 -1.435
## `Continente de procedencia_Colombia` -0.007457 0.004613 -1.617
## `Continente de procedencia_Europa` 0.047182 0.017591 2.682
## `Continente de procedencia_Norteamérica` -0.001864 0.023238 -0.080
## `Continente de procedencia_Sudamerica` -0.018870 0.043695 -0.432
## Pr(>|z|)
## (Intercept) 2.45e-07 ***
## elapsed_time < 2e-16 ***
## `Continente de procedencia_Asia` 0.66979
## `Continente de procedencia_Centroamérica` 0.15119
## `Continente de procedencia_Colombia` 0.10595
## `Continente de procedencia_Europa` 0.00731 **
## `Continente de procedencia_Norteamérica` 0.93606
## `Continente de procedencia_Sudamerica` 0.66585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(22.7256) family taken to be 1)
##
## Null deviance: 864.028 on 24 degrees of freedom
## Residual deviance: 23.798 on 17 degrees of freedom
## AIC: 254.17
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 22.73
## Std. Err.: 7.59
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -236.169
The AIC decreased in comparison with the first model(nb1), but very littlle. In addition, just one out of the 6 department dummy variables is significant.
Running negative binomial with the variable representing time plus the time, age and departments as pedictors
# Running negative binomial with the variable representing time, age and
# departments as predictors
nb5 <- glm.nb(y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` +
`Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` +
`Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` +
`Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` +
`Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
`Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
`Departamento o Distrito_Valle del Cauca`)
summary(nb5)##
## Call:
## glm.nb(formula = y ~ elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` +
## `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` +
## `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` +
## `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` +
## `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` +
## `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` +
## `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` +
## `Departamento o Distrito_Valle del Cauca`, init.theta = 2453700.017,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.89792 -0.11401 0.01296 0.18169 0.83694
##
## Coefficients:
## Estimate Std. Error z value
## (Intercept) 0.73607 0.18896 3.895
## elapsed_time 0.28751 0.02018 14.245
## `Edad_19 a 30` 0.05362 0.02475 2.166
## `Edad_31 a 45` 0.08583 0.02787 3.079
## `Edad_46 a 60` 0.02671 0.03566 0.749
## `Edad_60 a 75` 0.19705 0.15274 1.290
## `Edad_76+` -0.33801 0.43975 -0.769
## `Departamento o Distrito_Bogotá D.C.` -0.10930 0.03963 -2.758
## `Departamento o Distrito_Boyacá` 0.25278 0.73716 0.343
## `Departamento o Distrito_Caldas` 0.46662 0.26875 1.736
## `Departamento o Distrito_Casanare` -2.08704 1.82545 -1.143
## `Departamento o Distrito_Cauca` -0.41717 0.36530 -1.142
## `Departamento o Distrito_Cundinamarca` 0.16497 0.04484 3.679
## `Departamento o Distrito_Meta` -0.27178 0.12534 -2.168
## `Departamento o Distrito_Quindío` 0.20615 0.37400 0.551
## `Departamento o Distrito_Risaralda` -0.38367 0.18819 -2.039
## `Departamento o Distrito_Santander` 0.41024 0.18669 2.197
## `Departamento o Distrito_Tolima` 0.68273 0.26057 2.620
## `Departamento o Distrito_Valle del Cauca` -0.13528 0.05665 -2.388
## Pr(>|z|)
## (Intercept) 9.81e-05 ***
## elapsed_time < 2e-16 ***
## `Edad_19 a 30` 0.030317 *
## `Edad_31 a 45` 0.002074 **
## `Edad_46 a 60` 0.453827
## `Edad_60 a 75` 0.197007
## `Edad_76+` 0.442100
## `Departamento o Distrito_Bogotá D.C.` 0.005820 **
## `Departamento o Distrito_Boyacá` 0.731667
## `Departamento o Distrito_Caldas` 0.082520 .
## `Departamento o Distrito_Casanare` 0.252912
## `Departamento o Distrito_Cauca` 0.253449
## `Departamento o Distrito_Cundinamarca` 0.000234 ***
## `Departamento o Distrito_Meta` 0.030131 *
## `Departamento o Distrito_Quindío` 0.581496
## `Departamento o Distrito_Risaralda` 0.041478 *
## `Departamento o Distrito_Santander` 0.027985 *
## `Departamento o Distrito_Tolima` 0.008789 **
## `Departamento o Distrito_Valle del Cauca` 0.016953 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2453705) family taken to be 1)
##
## Null deviance: 7858.9108 on 24 degrees of freedom
## Residual deviance: 3.3026 on 6 degrees of freedom
## AIC: 205.51
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2453700
## Std. Err.: 24897176
## Warning while fitting theta: alternation limit reached
##
## 2 x log-likelihood: -165.509
When the variables age and departments are both included in the model, the AIC reduces in comparison to the model with the lowest AIC so far(nb1).
# Plotting standardized residuals vs fitted values
stdres <- rstandard(nb5)
plot(nb5$fitted.values, stdres)
abline(h = 0, col = "gray")# Calculating overdispersion n=25 k=19 n-k=6
est.overdispersion <- sum(stdres^2)/6
est.overdispersion## [1] 3.853961
Here the overdispersion is a bit high. Angela, do you know if in the negative binomial the assumptions of equidispersion are the same of Poisson? which are:
Overdispersion has to be close to one .
The range of the standardized residual has to be between -1 and 1.
Applying ANOVA to compare the negative binomial models
We decided to compare nb1, nb2, nb5, because they are nested and we are more interested in seeing if the model 5 is in fact better than the first model.
## Likelihood ratio tests of Negative Binomial Models
##
## Response: y
## Model
## 1 elapsed_time
## 2 elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+`
## 3 elapsed_time + `Edad_19 a 30` + `Edad_31 a 45` + `Edad_46 a 60` + `Edad_60 a 75` + `Edad_76+` + `Departamento o Distrito_Bogotá D.C.` + `Departamento o Distrito_Boyacá` + `Departamento o Distrito_Caldas` + `Departamento o Distrito_Casanare` + `Departamento o Distrito_Cauca` + `Departamento o Distrito_Cundinamarca` + `Departamento o Distrito_Meta` + `Departamento o Distrito_Quindío` + `Departamento o Distrito_Risaralda` + `Departamento o Distrito_Santander` + `Departamento o Distrito_Tolima` + `Departamento o Distrito_Valle del Cauca`
## theta Resid. df 2 x log-lik. Test df LR stat.
## 1 1.125073e+01 23 -253.9080
## 2 1.257832e+01 18 -251.9442 1 vs 2 5 1.963751
## 3 2.453700e+06 6 -165.5091 2 vs 3 12 86.435145
## Pr(Chi)
## 1
## 2 8.541378e-01
## 3 2.409184e-13
The fifth model is definetly the best, is the one with the lowest p value.
Preliminary conclusions
In the case of the Poisson, the model with the lower AIC (203.51) includes 3 variables as predictor: time, age and departments
In the case of the Negative binomial, the model with the lower AIC (205.51) includes also the same 3 variables of the best Poisson model.
Predictive accuracy
Calculating the predictive accuracy of the best model of Poisson and the best model of the Negative Binomial.
Predictive accuracy of the Poisson model
Predicting with a 95% confidence interval
## fit lwr upr
## 1 1.974648 1.332991 2.925179
## 2 4.834856 3.574338 6.539907
## 3 8.990605 6.638468 12.176149
## 4 11.180386 8.280195 15.096388
## 5 12.300336 8.491267 17.818102
## 6 19.821523 16.768624 23.430233
## 7 32.147925 23.952591 43.147276
## 8 40.554542 32.533719 50.552810
## 9 56.442432 43.971719 72.449934
## 10 79.443730 64.078012 98.494102
## 11 106.202123 89.950458 125.390034
## 12 144.907375 123.469566 170.067393
## 13 166.649056 143.604165 193.392076
## 14 196.542355 170.935542 225.985170
## 15 258.791545 229.143210 292.276013
## 16 358.155418 323.051083 397.074365
## 17 401.011234 363.835748 441.985183
## 18 417.425361 379.504684 459.135128
## 19 459.209857 419.162388 503.083528
## 20 516.458369 473.793775 562.964861
## 21 594.837208 549.008150 644.491897
## 22 675.890350 626.889835 728.720965
## 23 757.917923 705.838598 813.839850
## 24 899.457645 842.574303 960.181258
## 25 991.853197 932.047559 1055.496316
Calculating the proportion of times that the true value of y is in the 95% confidence interval of our model. (This is a kind of accuracy I suppose right?)
n <- 25
freq_coverage <- sum(y >= conf.df[, 2] & y <= conf.df[, 3])
freq_coverage <- freq_coverage/n
freq_coverage## [1] 0.92
We have a 92% of coverage!!! I think this is extremely good, even if it has been predicted in the training data. Because this doesn’t necessarily happens.
# plotting the observed y (blue dots) versus the confidence interval (gray
# dots)
plot(y, col = "blue")
points(conf.df$lwr, col = "gray")
points(conf.df$upr, col = "gray")#plotting the prediction line (straight blue line) with a 95% confidence interval (shaded blue line) together with the observed values (black dots)
ggplot(data1, aes(elapsed_time, y)) + geom_ribbon(aes(x = elapsed_time, ymin = conf.df$lwr,
ymax = conf.df$upr), data = data1, fill = color_scheme_get("blue")[[2]]) +
geom_line(aes(x = elapsed_time, y = conf.df$fit), data = data1, color = color_scheme_get("blue")[[4]],
size = 1.1) + geom_point(aes(x = elapsed_time, y = y)) + xlab("Days") +
ylab("Total cases") + scale_x_discrete(limit = c(8, 15, 22, 29, 36), labels = c("2-3",
"9-3", "16-3", "23-3", "30-3")) + # facet_wrap('reg', scales ='free')+
theme(strip.text.x = element_text(size = 12, colour = "black"), axis.text.x = element_text(face = "bold",
color = "#993333", angle = 45, size = 9), axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22))We have almost like a perfect model!!!
But now let’s perform a leave-one-out cross validation to see what result we get. (obs: I didn’t find the code to predict with a 95% interval, so I am just calculating the overall prediction error)
## [1] 130359.2 125147.4
I am not being able to understand this error number. Is it too high? I will research tomorrow. I know the one in the left is the standard error and the one in the right is the adjusted error.
I runned again a leane-one-out cross-validation with the other two Poisson models
## [1] 6993.699 6776.927
## [1] 4144.633 4099.331
So if you compare poisson4 with these other two models you can see that as you add more variables the test error decreases. So we should choose the model just with time as predictor (poisson1quasi)??? Even if the model with many predictors produces a better AIC and the variables are statistical significant? I don’t know what to do.
Predictive accuracy of the Negative Binomial model
Predicting with a 95% confidence interval
## fit lwr upr
## 1 1.974639 1.332980 2.925175
## 2 4.834841 3.574319 6.539901
## 3 8.990565 6.638413 12.176140
## 4 11.180353 8.280153 15.096376
## 5 12.300362 8.491258 17.818195
## 6 19.821493 16.768580 23.430225
## 7 32.147934 23.952554 43.147368
## 8 40.554579 32.533683 50.552956
## 9 56.442454 43.971612 72.450166
## 10 79.443718 64.077780 98.494429
## 11 106.202242 89.950276 125.390569
## 12 144.907422 123.469042 170.068227
## 13 166.648983 143.603391 193.392951
## 14 196.542333 170.934568 225.986407
## 15 258.791526 229.141725 292.277863
## 16 358.155670 323.048905 397.077600
## 17 401.010827 363.832498 441.988235
## 18 417.425852 379.502115 459.139317
## 19 459.210217 419.159167 503.088182
## 20 516.458259 473.789373 562.969852
## 21 594.837596 549.003210 644.498538
## 22 675.889823 626.882851 728.727947
## 23 757.917904 705.830820 813.848777
## 24 899.457798 842.564370 960.192905
## 25 991.852774 932.035450 1055.509129
Calculating the proportion of times that the true value of y is in the 95% confidence interval of our model. (This is a kind of accuracy I suppose right?)
n <- 25
freq_coverage <- sum(y >= conf.df[, 2] & y <= conf.df[, 3])
freq_coverage <- freq_coverage/n
freq_coverage## [1] 0.92
We also obtain a 92% of coverage for the Negative Binomial model.
# plotting the observed y (blue dots) versus the confidence interval (gray
# dots)
plot(y, col = "blue")
points(conf.df$lwr, col = "gray")
points(conf.df$upr, col = "gray")#plotting the prediction line (straight blue line) with a 95% confidence interval (shaded blue line) together with the observed values (black dots)
ggplot(data1, aes(elapsed_time, y)) + geom_ribbon(aes(x = elapsed_time, ymin = conf.df$lwr,
ymax = conf.df$upr), data = data1, fill = color_scheme_get("blue")[[2]]) +
geom_line(aes(x = elapsed_time, y = conf.df$fit), data = data1, color = color_scheme_get("blue")[[4]],
size = 1.1) + geom_point(aes(x = elapsed_time, y = y)) + xlab("Days") +
ylab("Total cases") + scale_x_discrete(limit = c(8, 15, 22, 29, 36), labels = c("2-3",
"9-3", "16-3", "23-3", "30-3")) + # facet_wrap('reg', scales ='free')+
theme(strip.text.x = element_text(size = 12, colour = "black"), axis.text.x = element_text(face = "bold",
color = "#993333", angle = 45, size = 9), axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22))Also a very precise model.
But now let’s perform a leave-one-out cross validation to see what result we get. (obs: I didn’t find the code to predict with a 95% interval, so I am just calculating the overall prediction error)
## [1] 130356.7 125145.0
Approximately the same result as the Poisson model.
I runned again a leane-one-out cross-validation with the other two Negative Binomial models.
## [1] 119035.8 116020.8
## [1] 56611.84 56065.30
Here happens the same thing that happened in Poisson. If you compare nb5 with these other two models you can see that as you add more variables the test error decreases. So again we should choose the model just with time as predictor (nb1)??? Even if the model with many predictors produces a better AIC and the variables are statistical signficant? Again, I don’t know what to do.
Final conclusions
I thought that with the predictive accuracy we would be able to choose between the poisson and the negative binomial. So this didn’t happen. They perform equally, both when we predict with the 95% confidence interval in the training data and in the leave-one-out cross validation. And now there is the problem of the overfitting. So what do we do now?
Interesting links
http://freerangestats.info/blog/2020/05/09/covid-population-incidence